Exploring habitat preference and the phylogenetic tree of Myxococcota

Author

Daniel Padfield

Published

April 4, 2023

Outline

This document plots the phylogenetic tree and visually examines the congruence between the tree and their taxonomic assignments. It then plots the habitat preference for each species/tip on the tree and we look at some simple descriptive statistics of the the habitat preferences.

Finally some key files used in other walk-throughs are saved out in this analysis for easy reading in for other scripts.

Plot the tree with taxonomic information

Firstly we will plot the tree and highlight the taxonomic information we have for each tip.

Load in R packages

Load in the R packages I am going to use.

Code
# load packages - use librarian::shelf()
# need to ensure BiocManager and Biobase are installed - BiocManager::install("Biobase")
# can install packages from GitHub, Bioconductor, and CRAN all at once
librarian::shelf(here, tidyverse, ggtree, ggnewscale, RColorBrewer, patchwork, ape, phytools, MetBrewer)

Load in data and do some data wrangling

There are three datasets that need aggregating for this analysis.

  1. The ultrametric phylogenetic tree
  2. The taxonomic information for each ASV
  3. The information on habitat preferences for each ASV

We can load each in and then do some data wrangling.

Code
# set where I am in the project
here::i_am('scripts/sequencing_rpoB/analyses/phylogenetics/plot_phylogeny.qmd')

# load in the tree
tree <- ape::read.tree(here('data/sequencing_rpoB/raxml/trees/myxo_asv/myxo_asv_chronopl10.tre'))

# read in habitat preference
d_habpref <- read.csv(here('data/sequencing_rpoB/phyloseq/myxococcus/habitat_preference/summary/habitat_preference_asv.csv'))

# read in phyloseq object and grab tax table
d_taxa <- readRDS(here('data/sequencing_rpoB/phyloseq/myxococcus/prevalence_filtered/ps_otu_asv_filt.rds')) %>%
  phyloseq::tax_table() %>%
  data.frame() %>%
  janitor::clean_names() %>%
  rownames_to_column('otu')

# create d_meta
d_meta <- left_join(select(d_habpref, otu, habitat_preference, num_present), select(d_taxa, otu:family)) %>%
  mutate(family2 = ifelse(is.na(family), 'uncertain', family))
  
# set habitat colours
cols_hab <- met.brewer('Austria', n = 7) %>% as.character()
cols_labels <- c('marine_mud', 'freshwater', 'terrestrial', 'freshwater:terrestrial', 'freshwater:marine_mud:terrestrial', 'marine_mud:terrestrial', 'freshwater:marine_mud')

# create a named vector of the colours to match up with d_meta
names(cols_hab) <- cols_labels

# create new habitat labels
hab_labels <- c('marine mud specialist', 'freshwater specialist', 'terrestrial specialist', 'freshwater + terrestrial generalist', 'full generalist', 'marine mud + terrestrial generalist', 'freshwater + marine mud generalist')
names(hab_labels) <- cols_labels

# sort the new labels to be in the correct order
hab_labels <- hab_labels[sort(names(hab_labels))]

Plot the tree with taxonomy

We can now plot the taxonomic information on a plot with the habitat preferences to visualise the phylogenetic tree. This uses the R packages ggtree and ggnewscale to plot the tree and combine multiple colour scales respectively.

We constrained our tree based on the phylogeny on a new consensus single marker phylogeny of all the delta proteobacteria (Waite et al. 2020), so we will colour by the families that were used in the constraint tree. These families we used to constrain the tree were those with common, established names in the GTDB database.

Code
# constrained families
constrained_families <- c('Myxococcaceae', 'Vulgatibacteraceae', 'Anaeromyxobacteraceae', 'Polyangiaceae', 'Sandaracinaceae', 'Nannocystaceae', 'Haliangiaceae')

# find 9 most common families based on number of tips assigned to them
d_common <- d_meta %>%
  group_by(family2) %>%
  tally() %>%
  ungroup() %>%
  mutate(., prop = n / sum(n)) %>%
  filter(., family2 %in% constrained_families)
  
sum(d_common$prop)
[1] 0.7363602

The families we used to constrain the tree represent 74% of all the tips.

We can now add these into the meta dataframe as a new column. Anything that is not assigned to one of these families will be assigned “unconstrained”, and any ASVs that were not assigned to the family level will be assigned “uncertain”. To colour branches of the tree, we need to use ggtree::groupOTU().

Code
# add in column for colouring branches
d_meta <- mutate(d_meta, family_common = ifelse(family2 %in% c(constrained_families, 'uncertain'), family2, 'unconstrained')) %>%
  rename(tip_label = otu)

# group tip labels together in terms of their order
to_group <- split(d_meta$tip_label, d_meta$family_common)
tree2 <- groupOTU(tree, to_group)

We can also find the MRCA ancestor node of each of our constrained families. This can then be used to find colour the outside of our plot with clade colours.

Code
# find the mrca of each of the clades
d_meta2 <- filter(d_meta, family_common %in% constrained_families) %>%
  select(family, tip_label) %>%
  group_by(family) %>%
  nest() %>%
  ungroup()

for(i in 1:nrow(d_meta2)){
  d_meta2$mrca[i] <- findMRCA(tree2, tips = d_meta2$data[[i]]$tip_label)
}

d_meta2 <- select(d_meta2, family2 = family, mrca) %>%
  mutate(blank_label = '')

We are now ready to plot the phylogenetic tree. We will plot the tree with different colours for each family, and then place the colours of the habitat preferences around the tips of the tree.

Code
# create colour palettes

# for different families - add in black and grey for uncertain and uncommon respectively
cols <- c(colorRampPalette(brewer.pal(11, "Spectral"))(nrow(d_common)), 'black', 'grey')
names(cols) <- c(sort(d_common$family2), 'uncertain', 'unconstrained')

# make a separate column for the rare states to make their size bigger!
d_meta <- mutate(d_meta, rare = ifelse(habitat_preference %in% c('freshwater:marine_mud:terrestrial', 'marine_mud:terrestrial', "freshwater:marine_mud"), 'rare', 'common'))

# first only plot taxonomy, also make non circular so we can see groupings properly
tree_plot <- ggtree(tree2, aes(col = group)) %<+% filter(d_meta) +
  scale_color_manual('Family (branch colours)', values = cols) +
  guides(color = guide_legend(override.aes = list(linewidth = 3))) +
  geom_cladelab(data = d_meta2,
                mapping = aes(node = mrca,
                              color = family2,
                              label = blank_label),
                offset = 0.1,
                barsize = 2,
                show.legend = FALSE)

tree_plot

Code
# save plot out
#ggsave(here('sequencing_rpoB/plots/analyses/tree_all_taxonomy.pdf'), tree_plot2, height = 9, width = 12)
ggsave(here('plots/sequencing_rpoB/analyses/tree_check_taxonomy.png'), tree_plot, height = 10, width = 5)

This shows that all of the families specified in our constraint tree are grouped together.

We can now easily plot this as a circular tree.

Code
# first only plot taxonomy, also make non circular so we can see groupings properly
tree_plot <- ggtree(tree2, layout = 'circular', branch.length = 'none', aes(col = group)) %<+% filter(d_meta) +
  scale_color_manual('Family (branch colours)', values = cols) +
  guides(color = guide_legend(override.aes = list(linewidth = 3))) +
  geom_cladelab(data = d_meta2,
                mapping = aes(node = mrca,
                              color = family2,
                              label = blank_label),
                offset = 10,
                barsize = 2,
                show.legend = FALSE)

tree_plot

Plot the tree with habitat preference information

We can now add information about the habitat preference of our ASVs to the dataset. But first lets describe our habitat preference trait.

What is our trait and what is its distribution?

Our trait is our own definition of habitat preference for each myxococcota ASV. Briefly, for each ASV we:

  • Calculated proportions of occurrence in each of the three habitat clusters identified from the 16S data (broadly these were terrestrial, freshwater, and marine mud).
  • If you only turned up in a single cluster, you were assigned as being a specialist in that habitat.
  • If you turned up in more than one cluster, we used a bootstrapping approach to see assign habitat preference:
    • Bootstrapped a distribution of proportions of occurrences in each habitat by re-sampling 100 presences based on the actual used proportions 1000 times.
    • Compared these to the availability of each habitat cluster (different clusters had different total numbers of samples assigned to them)
    • If an ASV’s use of a habitat was significantly less than expected from the cluster’s availability (0.975 quantile of bootstrapped use was less than the proportion available) we defined that habitat as unfavourable.
    • Habitat preferences was defined as all the habitats you use at least as much as is expected by their availability.
  • Overall, an ASV could be assigned to be a specialist in freshwater, terrestrial, or marine mud, or be a generalist in any of these three habitats.

We can look at the numbers of each of these and how many there are in each group.

Code
# first lets look at how many are in each group
group_by(d_meta, habitat_preference) %>%
  summarise(n = n(),
            ave_present = median(num_present)) %>%
  arrange(desc(n)) %>%
  mutate(prop = round(n/sum(n)*100, 2))
# A tibble: 7 × 4
  habitat_preference                    n ave_present  prop
  <chr>                             <int>       <dbl> <dbl>
1 freshwater                          738         5   28.2 
2 terrestrial                         704         5   26.9 
3 marine_mud                          568         5   21.7 
4 freshwater:terrestrial              488         6   18.6 
5 freshwater:marine_mud               112         6    4.27
6 freshwater:marine_mud:terrestrial     6         5.5  0.23
7 marine_mud:terrestrial                5         5    0.19

We only have 6 ASVs that have been classified as full generalists, and only 5 that are marine mud + terrestrial generalists. These are only 0.23% and 0.15% of all the ASVs. This will likely make them very, very difficult to include in our comparative phylogenetic analyses. The next smallest group of ASVs are freshwater + terrestrial generalists (4.4% of ASVs), so it makes sense to group all of these together as “marine mud generalists”. We will have a look at doing this later on in the script.

We can also see how many of our ASVs just turned up in a single cluster and which of those turned up in multiple.

Code
d_habpref %>%
  group_by(habitats_present) %>%
  tally() %>%
  mutate(prop = round(n/sum(n), 2))
# A tibble: 3 × 3
  habitats_present     n  prop
             <int> <int> <dbl>
1                1  1448  0.55
2                2   939  0.36
3                3   234  0.09
Code
d_habpref %>%
  group_by(habitats_present, habitat_preference) %>%
  tally() %>%
  group_by(habitat_preference) %>%
  mutate(prop = round(n/sum(n), 2)) %>%
  arrange(habitat_preference, habitats_present)
# A tibble: 14 × 4
# Groups:   habitat_preference [7]
   habitats_present habitat_preference                    n  prop
              <int> <chr>                             <int> <dbl>
 1                1 freshwater                          329  0.45
 2                2 freshwater                          334  0.45
 3                3 freshwater                           75  0.1 
 4                2 freshwater:marine_mud                99  0.88
 5                3 freshwater:marine_mud                13  0.12
 6                3 freshwater:marine_mud:terrestrial     6  1   
 7                2 freshwater:terrestrial              358  0.73
 8                3 freshwater:terrestrial              130  0.27
 9                1 marine_mud                          527  0.93
10                2 marine_mud                           41  0.07
11                2 marine_mud:terrestrial                5  1   
12                1 terrestrial                         592  0.84
13                2 terrestrial                         102  0.14
14                3 terrestrial                          10  0.01

Firstly we can see that 50% of our ASVs are only ever present in one habitat, compared to 40% in two habitats, and 9/10% in all three habitats.

Second we can see that within our freshwater specialists, 44% of them only occur in the freshwater cluster, and 56% of them occur in more than a single habitat, this is a much higher number than for marine mud specialists (30% also occur in other habitats) and terrestrial specialists (14% occur in other habitats).

This may have something to do with freshwater having the lowest number of samples in their cluster, but I cannot work out why that would be right now.

Plot tree with habitat preference

We can now add the habitat preference information round the outside of the phylogenetic tree. We will make the rare states (full generalist, marine mud + terrestrial generalist, and freshwater + marine mud generalist) larger so we can see how they are distributed on our tree.

Code
# reuse tree made above
# now plot habitat preference around the tips
tree_plot2 <- tree_plot +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
  new_scale_color() +
  geom_tippoint(aes(x=x+5, col = habitat_preference, size = rare), position = position_jitter(width = 3, height = 0)) +
  scale_size_manual(values = c(0.6, 3)) +
  scale_color_manual('Habitat preference (tip points)', values = cols_hab, labels = hab_labels) +
  guides(color = guide_legend(override.aes = list(size = 5)),
         size = 'none')

tree_plot2 

Code
# save plot out
#ggsave(here('sequencing_rpoB/plots/analyses/tree_all_taxonomy.pdf'), tree_plot2, height = 9, width = 12)
ggsave(here('plots/sequencing_rpoB/analyses/tree_all_taxonomy.png'), tree_plot2, height = 9, width = 12)

Ok this looks pretty darn amazing. There are a couple of things to take home from this.

  • Marine mud clades look very well conserved
  • The other states look less well conserved
  • The two rarest states (full generalist and marine mud + terrestrial generalist) do not really cluster together. This may make them very difficult to include in models.

Plot tree after collapsing our rare states into a single trait

As a consequence of how rare it is to be a marine mud generalist, we will visualise how the tree looks if we add a column to collapse these into a single trait.

Code
# create new habitat preference vector
d_habpref <- mutate(d_habpref, habitat_preference2 = ifelse(habitat_preference %in% c('freshwater:marine_mud:terrestrial', 'freshwater:marine_mud', 'marine_mud:terrestrial'), 'marine_mud_generalist', habitat_preference),
                    # rename all habitat preference vectors for easy renaming
                    habitat_preference3 = case_when(habitat_preference2 == 'marine_mud_generalist' ~ "marine mud generalist",
                                                    habitat_preference2 == 'marine_mud' ~ "marine mud specialist",
                                                    habitat_preference == "terrestrial" ~ "terrestrial specialist",
                                                    habitat_preference2 == "freshwater" ~ "freshwater specialist",
                                                    habitat_preference2 == "freshwater:terrestrial" ~ "freshwater + terrestrial generalist"))

d_habpref %>% group_by(habitat_preference3) %>%
  tally() %>%
  arrange(-n)
# A tibble: 5 × 2
  habitat_preference3                     n
  <chr>                               <int>
1 freshwater specialist                 738
2 terrestrial specialist                704
3 marine mud specialist                 568
4 freshwater + terrestrial generalist   488
5 marine mud generalist                 123

We can now replot this tree but we need to remake d_meta and also set the colour scheme up again, this time for five colours.

Code
# remake colour scheme
cols_hab2 <- cols_hab[names(cols_hab) %in% c('marine_mud', 'freshwater', 'terrestrial', 'freshwater:terrestrial')]
cols_hab2 <- c(cols_hab2, '#721b3e')
names(cols_hab2) <- c('marine mud specialist', 'freshwater specialist', 'terrestrial specialist', 'freshwater + terrestrial generalist', 'marine mud generalist')

# create d_meta
d_meta_new <- left_join(select(d_habpref, otu, habitat_preference3, num_present), select(d_taxa, otu:family)) %>%
  mutate(rare2 = ifelse(habitat_preference3 == "marine mud generalist", 'rare', "common"))

tree_plot3 <- tree_plot %<+% d_meta_new +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
  new_scale_color() +
  geom_tippoint(aes(x=x+5, col = habitat_preference3, size = rare2), position = position_jitter(width = 3, height = 0)) +
  scale_color_manual('Habitat preference (tip points)', values = cols_hab2) +
  scale_size_manual(values = c(0.6, 3)) +
  guides(color = guide_legend(override.aes = list(size = 5)),
         size = 'none')

tree_plot3

Code
# save out
ggsave(here('plots/sequencing_rpoB/analyses/tree_habpref_combined.png'), tree_plot3, height = 9, width = 12)

Finally we will save out the new habitat preference vector where we have combined the habitat preference states.

Code
# save out habitat preference vector
write.csv(d_habpref, here('data/sequencing_rpoB/phyloseq/myxococcus/habitat_preference/summary/habitat_preference_asv_new.csv'), row.names = FALSE)
saveRDS(cols_hab2, here('data/sequencing_rpoB/phyloseq/myxococcus/habitat_preference/summary/habitat_colours.rds'))